change the create dist job functionn to support creating a single nod…#240
Merged
hemildesai merged 3 commits intoJun 5, 2025
Merged
Conversation
…e job and distribuited jobs Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
Contributor
Author
|
Modified the formatting to meet the repository requirements |
roclark
previously approved these changes
May 22, 2025
roclark
left a comment
Contributor
There was a problem hiding this comment.
LGTM! do we also want to add a test for launching a single-node job to ensure it goes down the single-node path in create_training_job? Thanks for putting this together!
Signed-off-by: Francisco Delgado <fdelgadolope@fdelgadolope-mlt.client.nvidia.com>
Contributor
Author
Contributor
|
Awesome work, thanks!! |
hemildesai
approved these changes
May 28, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request refactors the
create_distributed_jobmethod to a more general-purposecreate_training_jobmethod in theDGXCloudExecutorclass, improving functionality and clarity. It also updates related test cases to align with the new method.closes #239
Core functionality changes:
create_distributed_jobtocreate_training_jobinnemo_run/core/execution/dgxcloud.py, adding support for both single-node and multi-node training jobs on DGX Cloud. The method now validates inputs, determines the appropriate endpoint based on node count, and constructs payloads accordingly.launchmethod to call the newcreate_training_jobmethod instead of the oldcreate_distributed_job.Test updates:
test_create_distributed_jobtotest_create_training_jobintest/core/execution/test_dgxcloud.pyto reflect the new method name. [1] [2]create_distributed_jobto mockcreate_training_jobinstead, ensuring compatibility with the refactored method. [1] [2]